20 research outputs found

    Accurate and Efficient Private Release of Datacubes and Contingency Tables

    Full text link
    A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a "strategy" set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise

    Rectangular Layouts and Contact Graphs

    Get PDF
    Contact graphs of isothetic rectangles unify many concepts from applications including VLSI and architectural design, computational geometry, and GIS. Minimizing the area of their corresponding {\em rectangular layouts} is a key problem. We study the area-optimization problem and show that it is NP-hard to find a minimum-area rectangular layout of a given contact graph. We present O(n)-time algorithms that construct O(n2)O(n^2)-area rectangular layouts for general contact graphs and O(nlogn)O(n\log n)-area rectangular layouts for trees. (For trees, this is an O(logn)O(\log n)-approximation algorithm.) We also present an infinite family of graphs (rsp., trees) that require Ω(n2)\Omega(n^2) (rsp., Ω(nlogn)\Omega(n\log n)) area. We derive these results by presenting a new characterization of graphs that admit rectangular layouts using the related concept of {\em rectangular duals}. A corollary to our results relates the class of graphs that admit rectangular layouts to {\em rectangle of influence drawings}.Comment: 28 pages, 13 figures, 55 references, 1 appendi

    Exact and Approximation Algorithms for Clustering (Extended Abstract)

    No full text
    In this paper we present an n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L2 metrics. The algorithm extends to other metrics, and to the discrete k-center problem. We also describe a simple (1+ ffl)- approximation algorithm for the k-center problem, with running time O(n log k) + (k=ffl) O(k 1\Gamma1=d ) . Finally, we present a n O(k 1\Gamma1=d ) time algorithm for solving the L-capacitated k- center problem, provided that L = \Omega\Gamma n=k 1\Gamma1=d ) or L = O(1). We conclude with a simple approximation algorithm for the L-capacitated k-center problem

    Covering Points by Strips in the Plane

    No full text
    Given a set S of n points in R d and an integer k ? 0; we want to cover S by k strips so that the maximum width of a strip is minimized. This problem stems from the pattern-discovering class of problems with important applications in data mining, pattern recognition etc. Let w be the smallest value so that S can be covered by k strips, each of width at most w : Computing k strips of width even at most Cw ; for any constant C ? 0; that cover S is known to be NP-Complete [31], even when d = 2: In this paper we propose an efficient approximation algorithm for the planar case, that approximates both the number of strips and the optimal width. More precisely, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S; and whose expected running time is O(nk 2 log 4 n); if k 2 log k n: Our algorithm also works for larger values of k; but then the expected running time is O(n 2=3 k 8=3 log 4 n): For the case k = d = 2; the b..

    Summary Graphs for Relational Database Schemas

    No full text
    Increasingly complex databases need ever more sophisticated tools to help users understand their schemas and interact with the data. Existing tools fall short of either providing the “big picture,” or of presenting useful connectivity information. In this paper we define summary graphs, a novel approach for summarizing schemas. Given a set of user-specified query tables, the summary graph automatically computes the most relevant tables and joins for that query set. The output preserves the most informative join paths between the query tables, while meeting size constraints. In the process, we define a novel information-theoretic measure over join edges. Unlike most subgraph extraction work, we allow meta edges (i.e., edges in the transitive closure) to help reduce output complexity. We prove that the problem is NP-Hard, and solve it as an integer program. Our extensive experimental study shows that our method returns high-quality summaries under independent quality measures

    Exact and Approximation Algorithms for Clustering

    No full text
    In this paper we present a n O(k1�1=d) time algorithm for solving the k-center problem in R d, under L1 and L2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete k-center problem, as well. We also describe a simple (1 +)-approximation algorithm for the k-center problem, with running time O(n log k) + (k = ) O(k1�1=d). Finally, we present a n O(k1�1=d) time algorithm for solving the L-capacitated k-center problem, provided that L = (n=k 1�1=d) or L = O(1). We conclude with a simple approximation algorithm for the L-capacitated k-center problem

    http://wrap.warwick.ac.uk Original citation: PrivBayes: Private Data Release via Bayesian Networks

    No full text
    ABSTRACT Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art goal for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PRIVBAYES, a differentially private method for releasing high-dimensional data. Given a dataset D, PRIVBAYES first constructs a Bayesian network N , which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of lowdimensional marginals of D. After that, PRIVBAYES injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PRIVBAYES samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PRIVBAYES circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the highdimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PRIVBAYES on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy
    corecore